Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
及时的调整已成为模型调整的新范式,它在自然语言预处理甚至预处理方面都取得了成功。在这项工作中,我们探讨了迅速调整到多模式预处理的转移,重点是生成的多模式预审预周化模型,而不是对比度。具体而言,我们实施了迅速调整统一的序列到序列预测模型适应理解和生成任务。实验结果表明,轻重量提示调整可以通过填充并超过其他轻量调整方法来实现可比的性能。此外,与固定模型相比,迅速调整的模型表明了针对对抗性攻击的鲁棒性。我们进一步确定,实验因素,包括及时长度,及时的深度和重新聚集化,对模型性能产生了很大的影响,因此我们从经验上为迅速调整的设置提供了建议。尽管有观察到的优势,但我们仍然在迅速调整中发现了一些局限性,我们相应地指出了未来研究的方向。代码可在\ url {https://github.com/ofa-sys/ofa}中获得
translated by 谷歌翻译
许多现有的神经结构搜索(NAS)解决方案依赖于架构评估的下游培训,这需要巨大的计算。考虑到这些计算带来了大量碳足迹,本文旨在探索绿色(即环保)NAS解决方案,可以在不培训的情况下评估架构。直观地,由架构本身引起的梯度,直接决定收敛和泛化结果。它激励我们提出梯度内核假设:梯度可以用作下游训练的粗粒粒度,以评估随机初始化网络。为了支持假设,我们进行理论分析,找到一个实用的梯度内核,与培训损失和验证性能有良好的相关性。根据这一假设,我们提出了一种新的基于内核的架构搜索方法knas。实验表明,KNA可实现比图像分类任务的“火车-TER-TEST”范式更快地实现竞争力。此外,极低的搜索成本使其具有广泛的应用。搜索网络还优于两个文本分类任务的强大基线Roberta-Light。代码可用于\ url {https://github.com/jingjing-nlp/knas}。
translated by 谷歌翻译
常规域中的文本到图像生成长期以来一直是一个开放问题,这需要强大的生成模型和跨模型理解。我们提出CogView,一个带VQ-VAE牌器的40亿参数变压器来推进此问题。我们还展示了各种下游任务的FineTuning策略,例如,风格学习,超分辨率,文本图像排名和时装设计,以及稳定预制雷岭的方法,例如,消除南损失。Cogview在模糊的MS Coco DataSet上实现最先进的FID,优于以前的基于GAN的模型和最近类似的工作Dall-e。
translated by 谷歌翻译
This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our method includes two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable component. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two "old tricks" commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method sets a new record of 82.8% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.6%. It is worth noting that this prompting performance already outperforms linear probing by +2.1% and can even match fully fine-tuning in certain datasets. In addition, our prompting method shows competitive performance across different data scales and against distribution shifts. The code is publicly available at https://github.com/UCSC-VLAA/EVP.
translated by 谷歌翻译
现实世界图像Denoising是一个实用的图像恢复问题,旨在从野外嘈杂的输入中获取干净的图像。最近,Vision Transformer(VIT)表现出强大的捕获远程依赖性的能力,许多研究人员试图将VIT应用于图像DeNosing任务。但是,现实世界的图像是一个孤立的框架,它使VIT构建了内部贴片的远程依赖性,该依赖性将图像分为贴片并混乱噪声模式和梯度连续性。在本文中,我们建议通过使用连续的小波滑动转换器来解决此问题,该小波滑动转换器在现实世界中构建频率对应关系,称为dnswin。具体而言,我们首先使用CNN编码器从嘈杂的输入图像中提取底部功能。 DNSWIN的关键是将高频和低频信息与功能和构建频率依赖性分开。为此,我们提出了小波滑动窗口变压器,该变压器利用离散的小波变换,自我注意力和逆离散小波变换来提取深度特征。最后,我们使用CNN解码器将深度特征重建为DeNo的图像。对现实世界的基准测试的定量和定性评估都表明,拟议的DNSWIN对最新方法的表现良好。
translated by 谷歌翻译
视觉语言预训练(VLP)模型已在众多跨模式任务中实现了最先进的性能。由于它们经过优化以捕获内模性内和模式间的统计特性,因此仍然存在学习数据中提出的社会偏见的风险。在这项工作中,我们(1)通过比较了事实和反事实样本的[掩码] ED预测概率,引入了基于反事实的偏见测量\ emph {contrbias},以量化VLP模型中的社交偏见; (2)构建一个新型的VL偏置数据集,其中包括24K图像文本对,用于测量VLP模型中的性别偏见,我们从中观察到在VLP模型中普遍存在显着的性别偏见; (3)提出了一种VLP偏见方法\ emph {fairvlp},以最大程度地减少用于VLP DEBIASing的事实和反事实图像文本对之间的[掩码] ED预测概率的差异。尽管Cunderbias和FaiRVLP专注于社交偏见,但它们可以作为工具,并提供新的见解来探究和正规化VLP模型中的知识。
translated by 谷歌翻译
网络推断已在系统生物学和社会科学等多个领域进行了广泛的研究。学习网络拓扑和内部动力学对于了解复杂系统的机制至关重要。特别是,稀疏拓扑和稳定的动态是许多现实世界连续时间(CT)网络的基本特征。鉴于通常只有一组节点能够观察到,在本文中,我们认为线性CT系统可以描绘网络,因为它们可以通过传输函数对未测量的节点进行建模。另外,测量值往往很嘈杂,并且采样频率较低和不同。因此,我们考虑CT模型,因为离散的时间近似通常需要细粒度的测量和均匀的采样步骤。开发的方法应用了源自线性随机微分方程(SDE)的动力结构函数(DSF)来描述测量节点的网络。一种数值抽样方法,即预处理的曲柄 - 尼科尔森(PCN),用于完善粗粒轨迹以提高推理精度。开发方法的收敛属性对数据源的维度具有鲁棒性。蒙特卡洛模拟表明,开发的方法优于最先进的方法,包括稀疏贝叶斯学习(GSBL),宾果游戏,基于内核的方法,dyngenie3,genie3和arni。模拟包括随机和环网,以及合成生物网络。这些是具有挑战性的网络,表明可以在广泛的环境中应用开发的方法,例如基因调节网络,社交网络和通信系统。
translated by 谷歌翻译